21 research outputs found

    Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics

    Full text link
    We address the problem of video representation learning without human-annotated labels. While previous efforts address the problem by designing novel self-supervised tasks using video data, the learned features are merely on a frame-by-frame basis, which are not applicable to many video analytic tasks where spatio-temporal features are prevailing. In this paper we propose a novel self-supervised approach to learn spatio-temporal features for video representation. Inspired by the success of two-stream approaches in video classification, we propose to learn visual features by regressing both motion and appearance statistics along spatial and temporal dimensions, given only the input video data. Specifically, we extract statistical concepts (fast-motion region and the corresponding dominant direction, spatio-temporal color diversity, dominant color, etc.) from simple patterns in both spatial and temporal domains. Unlike prior puzzles that are even hard for humans to solve, the proposed approach is consistent with human inherent visual habits and therefore easy to answer. We conduct extensive experiments with C3D to validate the effectiveness of our proposed approach. The experiments show that our approach can significantly improve the performance of C3D when applied to video classification tasks. Code is available at https://github.com/laura-wang/video_repres_mas.Comment: CVPR 201

    Self-supervised Video Representation Learning by Pace Prediction

    Get PDF
    This paper addresses the problem of self-supervised video representation learning from a new perspective -- by video pace prediction. It stems from the observation that human visual system is sensitive to video pace, e.g., slow motion, a widely used technique in film making. Specifically, given a video played in natural pace, we randomly sample training clips in different paces and ask a neural network to identify the pace for each video clip. The assumption here is that the network can only succeed in such a pace reasoning task when it understands the underlying video content and learns representative spatio-temporal features. In addition, we further introduce contrastive learning to push the model towards discriminating different paces by maximizing the agreement on similar video content. To validate the effectiveness of the proposed method, we conduct extensive experiments on action recognition and video retrieval tasks with several alternative network architectures. Experimental evaluations show that our approach achieves state-of-the-art performance for self-supervised video representation learning across different network architectures and different benchmarks. The code and pre-trained models are available at https://github.com/laura-wang/video-pace.Comment: Correct some typos;Update some cocurent works accepted by ECCV 202

    Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics

    Get PDF
    This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc. Then a neural network is built and trained to yield the statistical summaries given the video frames as inputs. In order to alleviate the learning difficulty, we employ several spatial partitioning patterns to encode rough spatial locations instead of exact spatial Cartesian coordinates. Our approach is inspired by the observation that human visual system is sensitive to rapidly changing contents in the visual field, and only needs impressions about rough spatial locations to understand the visual contents. To validate the effectiveness of the proposed approach, we conduct extensive experiments with four 3D backbone networks, i.e., C3D, 3D-ResNet, R(2+1)D and S3D-G. The results show that our approach outperforms the existing approaches across these backbone networks on four downstream video analysis tasks including action recognition, video retrieval, dynamic scene recognition, and action similarity labeling. The source code is publicly available at: https://github.com/laura-wang/video_repres_sts.Comment: Accepted by TPAMI. An extension of our previous work at arXiv:1904.0359

    Speed Co-Augmentation for Unsupervised Audio-Visual Pre-training

    Full text link
    This work aims to improve unsupervised audio-visual pre-training. Inspired by the efficacy of data augmentation in visual contrastive learning, we propose a novel speed co-augmentation method that randomly changes the playback speeds of both audio and video data. Despite its simplicity, the speed co-augmentation method possesses two compelling attributes: (1) it increases the diversity of audio-visual pairs and doubles the size of negative pairs, resulting in a significant enhancement in the learned representations, and (2) it changes the strict correlation between audio-visual pairs but introduces a partial relationship between the augmented pairs, which is modeled by our proposed SoftInfoNCE loss to further boost the performance. Experimental results show that the proposed method significantly improves the learned representations when compared to vanilla audio-visual contrastive learning.Comment: Published at the CVPR 2023 Sight and Sound worksho

    Self-supervised Video Representation Learning by Pace Prediction

    No full text

    View-Invariant Human Action Recognition Based on a 3D Bio-Constrained Skeleton Model

    No full text

    A Balanced Heuristic Mechanism for Multirobot Task Allocation of Intelligent Warehouses

    No full text
    This paper presents a new mechanism for the multirobot task allocation problem in intelligent warehouses, where a team of mobile robots are expected to efficiently transport a number of given objects. We model the system with unknown task cost and the objective is twofold, that is, equally allocating the workload as well as minimizing the travel cost. A balanced heuristic mechanism (BHM) is proposed to achieve this goal. We raised two improved task allocation methods by applying this mechanism to the auction and clustering strategies, respectively. The results of simulated experiments demonstrate the success of the proposed approach regarding increasing the utilization of the robots as well as the efficiency of the whole warehouse system (by 5~15%). In addition, the influence of the coefficient α in the BHM is well-studied. Typically, this coefficient is set between 0.7~0.9 to achieve good system performance

    Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics

    No full text
    Engineering and Physical Sciences Research Council; National Natural Science Foundation of China; Chinese University of Hong Kon
    corecore